-
Notifications
You must be signed in to change notification settings - Fork 16
feat: new eval schema changes #685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…issues Address Copilot comments for coded evals
a3e9908 to
333821b
Compare
feat: wiring ExactMatch evaluator to new schema
feat: wiring JsonSimilarity evaluator to new schema
- Add version property detection to distinguish coded-evals from legacy files
- Update pull command to map coded-evals folder to local evals structure
- Update push command to upload files with version property to coded-evals folder
- Maintain backward compatibility with legacy evals folder structure
- Ensure eval command works out of the box with existing structure
fix: resolve eval set path for correct evaluator discovery
- Update load_eval_set to return both evaluation set and resolved path
- Fix evaluator discovery by using resolved path instead of original path
- Ensure eval command works with files in evals/eval-sets/ and evals/evaluators/
fix: cleaning up files
fix: address PR review comments
1. Move eval_set path resolution from runtime to CLI layer
- Resolve path in cli_eval.py before creating runtime
- Remove context update in runtime since path is already resolved
- Better separation of concerns
2. Clarify directory structure comments
- Make it explicit that os.path.join produces {self.directory}/evals/evaluators/
- Prevent confusion about directory paths
3. Add file deletion consistency for evaluation files
- Delete remote evaluation files when deleted locally
- Matches behavior of source file handling
- Ensures consistency across all file types
Addresses: #681 (review)
Addresses: #681 (review)
Addresses: #681 (comment)
feat: adding pull and push for coded-evals folder files
feat: wiring LLM judge evaluators to new schema
…_evals feat: Cherry-pick progress on parallelization of eval runs
feat: missing changes from llm eval wiring
feat: wire up trajectory eval
Add dedicated environment variable for eval endpoint routing with environment-aware
localhost detection using proper URL parsing, avoiding false positives and impact
to other services using UIPATH_URL.
Added `UIPATH_EVAL_BACKEND_URL` for eval-specific routing:
- Set to localhost URL (e.g., `http://localhost:8080`) for local development
- Leave unset or set to production URL for alpha/production environments
- Isolates eval endpoint routing from UIPATH_URL used by other services
**New Constant:**
- `ENV_EVAL_BACKEND_URL = "UIPATH_EVAL_BACKEND_URL"` in `constants.py`
**Updated Helper Method with Robust URL Parsing:**
- `_get_endpoint_prefix()` uses `urllib.parse.urlparse()` for accurate hostname detection
- Checks parsed hostname specifically (not substring matching)
- Prevents false positives like "notlocalhost.com" or "127.0.0.1.example.com"
- Returns `""` (empty) only when hostname is exactly "localhost" or "127.0.0.1"
- Returns `"agentsruntime_/"` for all other cases (including unset or parse failures)
- Handles edge cases: case-insensitive matching, ports, protocols
All 4 progress reporting endpoints now use `/coded/` path with conditional routing:
| Method | Endpoint Pattern |
|--------|------------------|
| PUT evalRun | `{prefix}api/execution/agents/{id}/coded/evalRun` |
| POST evalRun | `{prefix}api/execution/agents/{id}/coded/evalRun` |
| POST evalSetRun | `{prefix}api/execution/agents/{id}/coded/evalSetRun` |
| PUT evalSetRun | `{prefix}api/execution/agents/{id}/coded/evalSetRun` |
Where `{prefix}` is determined by `_get_endpoint_prefix()`:
- Localhost: `""` (empty - direct API access)
- Alpha/Prod: `"agentsruntime_/"` (service routing)
- `_update_eval_run_spec()` - Update eval run with results
- `_create_eval_run_spec()` - Create new eval run
- `_create_eval_set_run_spec()` - Create new eval set run
- `_update_eval_set_run_spec()` - Update eval set run completion
```bash
export UIPATH_EVAL_BACKEND_URL=http://localhost:8080
```
```bash
```
```bash
export UIPATH_EVAL_BACKEND_URL=https://alpha.uipath.com
```
✅ **Isolated Configuration:**
- Eval routing independent of UIPATH_URL
- Other services using UIPATH_URL remain unaffected
- Enables local eval testing without affecting other components
✅ **Robust URL Parsing:**
- Uses `urllib.parse.urlparse()` for accurate hostname extraction
- Prevents false positives from substring matching (e.g., "notlocalhost.com")
- Handles edge cases: ports, protocols, case sensitivity
- Graceful fallback on parsing errors
✅ **Simple & Explicit:**
- Single environment variable controls all eval endpoint routing
- Clear localhost detection (exact hostname match)
- Defaults to production routing when unset (safe fallback)
✅ **Backward Compatible:**
- No breaking changes to existing deployments
- Defaults to `agentsruntime_/` prefix when env not set
- Supports new `/coded/` evaluator API endpoints
✅ **Validation:**
- Syntax validation: Passed
- Logic verification: Passed (13 test scenarios including edge cases)
- Files changed: 2 modified (+29, -5 lines)
✅ **Environment Detection Tests:**
**Standard Cases:**
- `http://localhost:8080` → ✓ Empty prefix
- `http://127.0.0.1:3000` → ✓ Empty prefix
- `https://localhost` → ✓ Empty prefix
- `https://alpha.uipath.com` → ✓ `agentsruntime_/` prefix
- `https://cloud.uipath.com` → ✓ `agentsruntime_/` prefix
- Unset/empty → ✓ `agentsruntime_/` prefix (default)
**Edge Cases (False Positive Prevention):**
- `https://notlocalhost.com` → ✓ `agentsruntime_/` prefix (not localhost)
- `https://127.0.0.1.example.com` → ✓ `agentsruntime_/` prefix (not localhost)
- `https://mylocalhost.io` → ✓ `agentsruntime_/` prefix (not localhost)
**Case Sensitivity:**
- `http://LOCALHOST:8080` → ✓ Empty prefix (case-insensitive)
- `http://LocalHost:8080` → ✓ Empty prefix (case-insensitive)
**Problem:** Simple substring check `"localhost" in url` could match `"notlocalhost.com"`
**Solution:** Use `urlparse()` to extract exact hostname, preventing false positives
**Status:** Verified - documentation correctly references `UIPATH_EVAL_BACKEND_URL` throughout
- `src/uipath/_cli/_evals/_progress_reporter.py` (+29, -5)
- Added `urllib.parse.urlparse` import
- Improved `_get_endpoint_prefix()` with URL parsing logic
- `src/uipath/_utils/constants.py` (+1)
- Added `ENV_EVAL_BACKEND_URL` constant
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <[email protected]>
feat: wire tool evals, add mocked tool sample agent
refac: coded_evalutors -> evaluators, associated renames/changes
feat(evals): add dedicated UIPATH_EVAL_BACKEND_URL for localhost routing
feat: add support for custom evaluators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH this PR is way too big to review in one shot. If the individual PRs have already been approved, please merge it as-is, and address my comments in a subsequent PR please.
| assertion_runs, evaluator_scores = self._collect_results( | ||
| sw_progress_item.eval_results, evaluators, spans or [] # type: ignore | ||
| ) | ||
| spec = self._update_eval_run_spec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
evaluator_runs, evaluator_scores = ..
This way, you can remove the duplicate lines for spec = ...
| ) | ||
|
|
||
|
|
||
| class LegacyEvaluationItem(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please extend EvaluationItem. We can also get rid of AnyEvaluationItem.
| self.evaluations = selected_evals | ||
|
|
||
|
|
||
| class LegacyEvaluationSet(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please extend EvaluationSet.
| """ | ||
| if not evaluators: | ||
| return False | ||
| # Check the first evaluator type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only first?
| no_of_evals: int | ||
| evaluators: List[Any] | ||
| # skip validation to avoid abstract class instantiation | ||
| evaluators: SkipValidation[List[AnyEvaluator]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EvalSetRunCreatedEvent should not be a BaseModel IMO. Please switch to @dataclass instead..
| ) | ||
| return assertion_runs, evaluator_scores_list | ||
|
|
||
| def _collect_coded_results( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid duplication of code? High overlap between this and _collect_results.
| Discriminator(_discriminate_eval_set), | ||
| ] | ||
|
|
||
| AnyEvaluationItem = Union[EvaluationItem, LegacyEvaluationItem] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get rid of these two Any types..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, curious though what downside do you see with this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main issue is that currently LegacyEvaluationItem does not extend EvaluationItem but it semantically should. Any* for these purposes is an anti-pattern IMO.
|
|
||
| result.evaluation_time = execution_time | ||
| return result | ||
| class BaseEvaluator(BaseModel, Generic[T, C, J], ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use @dataclass from dataclasses instead of pydantic since types are not serializable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless, this base class is overly complex. Have we over-engineered this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @andrei-rusu
db9a265 to
7efe060
Compare
7efe060 to
73d8d30
Compare
73d8d30 to
602854e
Compare
35fc809 to
6cb6db5
Compare
These are required for build.
fix(TonOfFixes): lots of minor fixes
Model default is provided so updating the unit test.
| @@ -0,0 +1 @@ | |||
| print("to be implemented later") | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont forget about this
You can check the output file, see the logs
Printing output file...
=== OUTPUT FILE CONTENT ===
{
"output": {
"evaluationSetName": "Weather Tools Agent Evaluation",
"evaluationSetResults": [
then assert the json is like a schema or contains some known properties like `"evaluatorName": "ToolCallArgsEvaluator" with result and a score.
Done:
ToDo:
How to test:
example contains evaluator spec:
example eval set:
Development Package